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Abstract. Dissimilarity measures for (possibly weighted) phylogenetic trees based on the 
comparison of their vectors of path lengths between pairs of taxa, have been present in the 
systematics literature since the early seventies. But, as far as rooted phylogenetic trees goes, 
these vectors can only separate non- weighted binary trees, and therefore these dissimilarity 
measures are metrics only on this class. In this paper we overcome this problem, by splitting 
in a suitable way each path length between two taxa into two lengths. We prove that the 
resulting splitted path lengths matrices single out arbitrary rooted phylogenetic trees with 
nested taxa and arcs weighted in the set of positive real numbers. This allows the definition 
of metrics on this general class by comparing these matrices by means of metrics in spaces 
A1„(]R) of real- valued n x n matrices. We conclude this paper by establishing some basic 
facts about the metrics for non-weighted phylogenetic trees defined in this way using 
metrics on X„(R), with p G N \ {0}. 

1 Introduction 

The exponential increase in the amount of available genomic and metagenomic data 
has produced an explosion in the number of phylogenetic trees proposed by researchers: 
according to Rokas [24], phylogeneticists are currently publishing an average of 15 phy- 
logenetic trees per day. Many such trees are alternative phylogenies for the same sets 
of organisms, obtained from different datasets or using different evolutionary models or 
different phylogenetic reconstruction algorithms [16]. This variety of phylogenetic trees 
makes it necessary the existence of methods for measuring the differences between phy- 
logenetic trees [13, Ch. 30], and the safest way to quantify these differences is by using 
a metric, for which zero difference means isomorphism. 

The comparison of phylogenetic trees is also used to assess the stability of reconstruc- 
tion methods [31], and it is essential to performing phylogenetic queries on databases [18]. 
Further, the need for comparing phylogenetic trees also arises in the comparative analysis 
of clustering results obtained using different methods or different distance matrices, and 
there is a growing interest in the assessment of clustering results in bioinformatics [15]. 
Recent applications of the comparison of phylogenetic-like trees also include the study 
of the similarity between sequences, or sets of sequences, by measuring the difference be- 
tween their context trees [17]. In summary, and using the words of Steel and Penny [29], 
tree comparison metrics are an important aid in the study of evolution. 

Many metrics for phylogenetic tree comparison have been proposed so far, includ- 
ing the Robinson- Foulds, or partition, metric [22,23], the nearest-neighbor interchange 



metric [30], the subtree transfer distance [2], and the triples metric [9]. In the early sev- 
enties, several researchers proposed dissimilarity measures for (possibly weighted) rooted 
phylogenetic trees based on the comparison of the vectors of lengths of paths connect- 
ing pairs of taxa. The aim of these measures was to quantify the rate at which pairs of 
taxa that arc close together in one tree lie at opposite ends in another tree [19]. These 
authors defined the dissimilarity between a pair of trees as the euclidean distance be- 
tween the corresponding vectors of path lengths [10,11], the Manhattan distance between 
these vectors [31] or the correlation between these vectors [20]. Similar dissimilarity mea- 
sures have also been defined for unrooted phylogenetic trees [6,29]. Although different 
names have been used for these dissimilarity measures (cladistic difference [10], topolog- 
ical distance [20], path difference distance [29]), the term nodal distance seems to have 
prevailed [6,21]. According to Steel and Penny [29], they have several interesting features 
that make them deserve more study and consideration. 

The theoretical basis for these nodal distances is Smolenskii's theorem [28] establish- 
ing that two unrooted phylogenetic trees T, T' on the same set S of taxa are isomorphic 
if, and only if, for every pair of leaves the distances between i and j in T and in 
T' are the same. This result was later expanded by Zaretskii [32], who characterized the 
vectors of distances between pairs of leaves of an unrooted phylogenetic tree by means of 
the well-known four-point condition. Smolenskii's and Zaretskii's papers were published 
in Russian, and it has contributed to the fact that their results have been rediscovered 
and generalized many times [3,7,8,26]; for a modern textbook treatment of these results 
in all their generality (weighted unrooted trees with nested taxa), see [25, Ch. 7], and 
for a historical account, see [1]. 

Unfortunately, Smolenskii's theorem is not valid for arbitrary rooted phylogenetic 
trees: there exist non- isomorphic rooted phylogenetic trees with the same path lengths 
between pairs of leaves (sec Figs. 1, 2, 3). It turns out that only the fully resolved, or 
binary, non-weighted rooted phylogenetic trees are singled out by their path lengths 
vectors, and therefore the nodal distances based on the comparison of these vectors are 
metrics (more specifically, zero nodal distance means isomorphism) only on the space of 
non-weighted binary phylogenetic trees. Although this result seems to be known since 
the time of the first proposals of nodal distances, we have not been able to find an explicit 
proof in the literature, and thus, for the sake of completeness, we include a simple proof 
of this fact in Section 3, reducing it to the general version of Smolenksii's result. 

The main result of this paper is the definition of metrics on the space of arbitrary 
rooted phylogenetic trees that generalize the nodal distances, where arbitrary means non 
necessarily binary and with possibly nested taxa and arcs weighted in the set of positive 
real numbers. To do that, we split each path between two taxa into the paths from their 
least common ancestor to each taxa. In this way we associate to each rooted phylogenetic 
tree with n taxa an n x n matrix, with rows and columns indexed by the taxa, whose 
(z, j)-entry contains the length of the path from the least common ancestor of the i- 
th and j-th taxa to the i-th. taxon. Wc prove that these splitted path lengths matrices 
single out arbitrary rooted phylogenetic trees, and then we use them to define splitted 
nodal metrics on the space of weighted rooted phylogenetic trees with nested taxa by 



2 



comparing these matrices through real-valued norms applied to their difference. We also 
prove some basic properties of the splitted nodal metrics on the space of non-weighted 
rooted phylogenetic trees obtained using the LP norms, with p G N \ {0}. 

2 Notations and conventions 

A rooted tree is a non-empty directed finite graph that contains a distinguished node, 
called the root, from which every other node can be reached through exactly one path. 
An A-weighted rooted tree, with ^ C M, is a pair (T,lo) consisting of a rooted tree 
T = {V, E) and a weight function u : E ^ A that associates to every arc e e E a real 
number Lo{e) G A. In this paper we shall only consider two sets A of weights: the set of 
non-negative real numbers M^o = G M | t ^ 0}, and the set of positive real numbers 
M>o = {i G M I i > 0}. When the set A is irrelevant (for instance, in general definitions), 
we shall omit it and simply talk about weighted, instead of A-weighted, trees. We identify 
every non-weighted (that is, where no weight function has been explicitly defined) rooted 
tree T with the weighted rooted tree (T, lu) with lj the weight 1 constant function. 

Let T = {V, E) be a rooted tree. Whenever {u, v) G E, we say that f is a child of 
u and that u is the parent of v. Every node in T has exactly one parent, except the 
root, which has no parent. The number of children of a node is its out-degree. The nodes 
without children are the leaves of the tree, and the other nodes are called internal. An 
arc {u, v) is internal when its head v is internal, and pendant when v is a. leaf. The 
out-degree 1 nodes are called elementary. A tree is binary when all its internal nodes 
have out-degree 2. 

Given a path {vq,vi, . . . ,Vk) in a rooted tree T, its origin is vq, its end is Vk, and 
its intermediate nodes are vi, . . . , Vk-i- Such a path is non-trivial when k 1. We shall 
represent a path from u to v, that is, a path with origin u and end v, hy u-^v. Whenever 
there exists a (non-trivial) path U'^v,we shall say that v is & (non-trivial) descendant of 
u and also that n is a (non-trivial) ancestor of v. If f is a descendant of u, the path u~^v 
is unique. The distance from a node u to a descendant v of it in a weighted rooted tree 
is the sum of the weights of the arcs forming the unique path u^v; in a non- weighted 
rooted tree, this distance is simply the number of arcs of this path. The depth of a node 
V, in symbols depthy(v), is the distance from the root to v. 

The least common ancestor (LCA) of a pair of nodes u,v of a rooted tree T, in 
symbols [«, u]t, is the unique common ancestor of them that is a descendant of every 
other common ancestor of them. Alternatively, it is the unique common ancestor of u, v 
such that the paths from it to u and v have only their origin in common. In particular, 
if one of the nodes, say u, is an ancestor of the other, then [u, v]t = u. 

Let 5 be a non-empty finite set of labels, or taxa. A (weighted) phylogenetic tree on 
5 is a (weighted) rooted tree with some of its nodes, including all its leaves and its 
elementary nodes, bijectively labeled in the set S. In such a phylogenetic tree, we shall 
always identify, usually without any further mention, a labeled node with its taxon. The 
internal labeled nodes of a phylogenetic tree are called nested taxa. 
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Two phylogenetic trees T and T' on the same set S of taxa are isomorphic when they 
are isomorphic as directed graphs and the isomorphism sends each labeled node of T to 
the labeled node with the same label in T'; an isomorphism of weighted phylogenetic 
trees is also required to preserve arc weights. As usual, we shall use the symbol = to 
denote the existence of an isomorphism. 

Although our main object of study are the weighted phylogenetic trees, and hence 
they are rooted trees, in the next section there will also appear unrooted trees. An 
unrooted tree is an undirected finite graph where every pair of nodes is connected by 
exactly one path. An A-weighted unrooted tree is a pair (T, lo) consisting of an unrooted 
tree T = (V, E) and a weight function lo : E ^ A. The distance between two nodes in a 
weighted unrooted tree is the sum of the weights of the edges forming the unique path 
that connects these nodes. 

An unrooted tree is partially labeled in a set S when some of its nodes are bijectively 
labeled in the set S. An unrooted S-tree is an unrooted tree partially labeled in S with 
all its leaves and all its nodes of degree 2 labeled. 

Given a phylogenetic tree T = {V, E) on S, its unrooted version is the unrooted tree 
T" = {V, E"^) partially labeled in S obtained by replacing each arc {u, v) E E hy an edge 
{u,v} G and keeping the labels. 

The notion of isomorphism for (possibly weighted) partially labeled unrooted trees 
is similar to the notion given in the rooted case. Notice that if Ti = {Vi,Ei) and T2 = 
{V2,E2) are two phylogenetic trees on the same set S of taxa, with roots ri and r2, 
respectively, then a mapping / : ^ V2 is an isomorphism between Ti and T2 if, and 
only if, it is an isomorphism between T" and T2 and /(ri) = r2- 

3 Path lengths separate non-weighted binary phylogenetic trees 

Let T be an M^o-weighted phylogenetic tree on the set ^ = {1, . . . , n}. For every i,j G S, 
let irihj) and ItU, i) denote the distances from [i,j]T to i and j, respectively. The path 
length between two labeled nodes i and j is 

Definition 1. The path lengths vector ofT is the vector 

L(^) = (LH^,J))l^,<,^„eM"("-^)/^ 

with its entries ordered lexicographically in {i,j)- 

These path length vectors have been used since the early seventies to compare non- 
weighted, binary phylogenetic trees [10,20,31], but we have not been able to find an 
explicit proof in the literature of the fact that this kind of phylogenetic trees can be 
singled out by means of their path lengths vector. For the sake of completeness, we 
provide here a simple proof of this fact, derived from Smolcnskii's theorem [28] that 
establishes that the vector of distances between pairs of labeled nodes characterizes up 
to isomorphism an M>o-weighted unrooted S-tree; see also Thm. 7.1.8 in [25]. 



4 



Proposition 1. Two non-weighted binary phylogenetic trees on the same set S of taxa 
are isomorphic if, and only if, they have the same path lengths vectors. 

Proof. The 'only if implication is obvious. As far as the 'if implication goes, let Ti and 
T2 be two non-weighted binary phylogenetic trees on the same set S with the same path 
lengths vectors. If l^l = 1, the equivalence in the statement is obvious, because every 
phylogenetic tree with only one labeled node consists only of one node. So we assume 
henceforth that l^l ^ 2. 

For every t = 1,2, let {T^,ujt) be the M>o-weighted unrooted S-tree defined as follows: 

— If the root of Tt is labeled, then T^ = T^" and all edges of T^ have weight 1. 

— If the root rt of Tt is not labeled, and if ut,vt are the children of rt, then T* is 

obtained from by removing the node rt and replacing the edges {rt,ut},{rt,vt} 
by a single edge {ut,vt}, and then all edges of T^* have weight 1, except {ut,vt}, 
which has weight 2. 

It is straightforward to check that such a T* is always an unrooted S'-tree: the root rt of 
Tt is the only degree 2 node in and then, if it is labeled, T" is an unrooted S'-tree, 
and if it is non labeled, we remove it in the construction of T* without modifying the 
degrees of the remaining nodes. Moreover, it is also obvious from the construction that 
the distance between any pair of labeled nodes in T^* is equal to the path length between 
these nodes in Tj. In particular, T^ and T2* have the same distances between each pair 
of labeled nodes. Then, by [25, Thm. 7.1.8]. 
T* = T2 as weighted unrooted S'-trees. 

It remains to check that this isomorphism induces an isomorphism of phylogenetic 
trees Ti = T2. To do it, notice that, since the isomorphism between T^ and T2 preserves 
edge weights, there are only two possibilities: 

— All edges in T^ and T| have weight 1. In this case Tj* = T^ and T2 = and the 
isomorphism T" = Tg* sends the root of Ti to the root of T2, because they arc the 
only degree 2 nodes in Tj* and . Therefore, it induces an isomorphism Ti = T2. 

— Both T* and T2 have one weight 2 edge, say {ui,vi} and {u2,V2}, respectively. Then 
each is obtained from T^ by adding the root rt of Tt and splitting the edge 
{ut,Vt} into two edges {ut,rt} and {vt,rt}. Since the isomorphism Tj* = T| sends 
{ui,vi} to {u2,V2}, its extension to a mapping Vi V2 by sending ri to r2 defines 
an isomorphism T" = T2 that sends the root of Ti to the root of T2, and hence an 
isomorphism Ti = T2. □ 

Let BTn be the class of all non- weighted binary phylogenetic trees on S = {1, . . . , n}. 
The injectivity up to isomorphisms of the mapping 

L : BTn R"("-i)/2 
T ^ L{T) 
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makes the classical definitions of nodal metrics on BTn induced by metrics on M"'^" 
to yield, indeed, metrics. For example, recall that the norm on is defined as 

{\{ i \ i = 1, . . . ,m, 0}| if p = 

V^t^ ifpGN+ 
max{|xi| I i = 1, . . . , m} if p = oo 

where, here and henceforth, N"*" stands for N \ {0}. Each LP norm on induces 
then a metric on BT^ through the formula 

dp{Ti,T2) = \\L{Ti) - L{T2)\\p. 

Some of these metrics have been present in the literature since the early seventies. For 
instance, Farris [10] introduced the metric on BTn induced by the L^, or Euclidean, 
norm on M"("-i)/2: 



d2{T,,T2)= Yl {LTAiJ)-LT,{i,j)y 

(he called it cladistic difference), while Williams and Clifford [31] proposed the metric 
on BTn induced by the L^, or Manhattan, norm on M"("~^)/2: 

di{Ti,T2)= Y - LT2{i,j)\- 

Unfortunately, the path lengths vectors cannot be used to separate phylogenetic trees 
in much more general classes than the one considered in the previous proposition. For 
instance, they does not single out phylogenetic trees with nodes of out-degree greater 
than 2 (see Fig. 1), phylogenetic trees with (labeled) elementary nodes (see Fig. 2), 
and weighted binary phylogenetic networks with weights different from 1 (see Fig. 3). 
Therefore, no metric for general phylogenetic trees can be derived from path lengths 
alone. We overcome this problem in the next section. 




T T' 



Fig. 1. Two non-isomorphic non-binary phylogenetic trees with the same path lengths vectors. 
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Fig. 2. Two non-isomorphic phylogenetic trees with an elementary node and the same path lengths 
vectors. 



Fig. 3. Two non-isomorphic R>o-weighted binary phylogenetic trees with the same path lengths vectors. 

Remark 1. Let T be a non- weighted binary phylogenetic tree on a set S of taxa. Since 
the path lengths vector L{T) is the vector of distances of a (possibly weighted) unrooted 
<S-tree (see the proof of Proposition 1), it is well-known (see, for instance, Lem. 7.1.7 
in [25]) that it satisfies the four-point condition: for every a,b,c,d G S, 



Zaretskii's theorem [32] establishes that any dissimilarity measure on S satisfying this 
four-point condition is given by the distances between labeled nodes in an M>o-weighted 
unrooted S-tiee (see Thm. 7.2.6 in [25]). But, to our knowledge, it is not known what 
extra properties should be required to such a dissimilarity measure on S to guarantee 
that it is given by the path lengths between labeled nodes in a non-weighted binary 
phylogenetic tree. 

4 Splitted path lengths separate arbitrary phylogenetic trees 

Let {T,u), with T = {V,E), be again an R^o-weighted phylogenetic tree on S = 
{1, . . . ,n} and, for every i,j G S, let iT{i,j) and £T{j,i) still denote the distances from 
[i, j]t to i and j, respectively. 

Definition 2. The splitted path lengths matrix of T is the n x n square matrix over 




T' 



T 



Lxia, h) + Lt{c, d) ^ max{LT(a, c) + -Lt(6, d),LT{a, d) + -Lt(6, c)}. 




m = 



/^t(1,1) ^t(1,2) ... £T(l,n)\ 
£t(2,1) ^t(2,2) ... £T(2,n) 



e Xn(M^o)- 



YT{n,l) lT{n,2) ... lT{n,n)J 
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Notice that this matrix need not be symmetrical (see the next example), but all entries 
irihi) in its main diagonal are 0. 

The splitted path lengths matrix £{T) of a tree T e %, can be computed in optimal 
O(n^) time, by computing by breadth- first search for each internal node of T the distance 
to each one of its descendant taxa and the pairs of taxa of which it is the LCA. 

Example 1. The splitted path lengths matrices of the trees T and T' depicted in Fig. 1 
are 



The splitted path lengths matrices of the trees T and T' depicted in Fig. 2 are 



The splitted path lengths matrices of the weighted trees T and T' depicted in Fig. 3 

are 



This example shows that the splitted path lengths matrices can separate pairs of 
phylogenetic trees that could not be separated by means of their path lengths vectors. 
Our main result in this section states that these matrices characterize arbitrary M>o- 
weighted phylogenetic trees. To prove it, it is convenient to establish first some lemmas, 
and to recall a result from [14] . 

Lemma 1. Let T be an ^^Q-weighted phylogenetic tree on S. A label i E S is a nested 
taxon ofT if, and only if, iT{i,j) = for some j ^ i. 

Proof. If an internal node of T is labeled with i, then taking as j e S any descendant 

leaf of i we have that [i, j]t = i and hence ixiiyj) = 0. Conversely, if ^t(^) j) = 0, then 
[') j]t = i and therefore the node i is an ancestor of the node j. li i ^ j, this can only 
happen if i is internal. □ 

Lemma 2. Let T be an M.^Q-weighted phylogenetic tree on S. For every i E S, consider 
the set of weights 



(a) Wi = if, and only if, i is the root ofT. 

(b) IfWi 7^ 0, then its smallest element Wi is the weight of the arc with head i. 

Proof. As far as far (a) goes, Wi = if, and only if, iT{i,j) = for every j € S, that is, 
if, and only if, i is an ancestor of every labeled node. Since the set of labeled nodes of 






w^ = {iT(i,j)\j es, eT{ij)>o}. 
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T includes all leaves and all elementary nodes, this is equivalent to the fact that i is the 
root. 

As far as (b) goes, assume that Wi ^ 0, so that i has a parent x. Let Wi be the 
weight of the arc {x,i). Then, since every non-trivial path [i,j]r~^^ must end with the 
arc it is clear that if iT{i,j) > 0, then ixiijj) ^ Wi. 

Now, if X is labeled, say with label zq, then x = [i, iojr and thus irii, io) = Wi- If x is 
not labeled, then it cannot be elementary, and hence it must have at least another child 
y. Let io be a descendant leaf of y. In this case, x = [i,io]r and (.Tihio) = Wi, too. This 
proves that, in all cases, Wi G Wi, and thus that it is the smallest element of this set. □ 

The following result is a direct consequence of the last two lemmas. 

Corollary 1. Let T and T' he two ]B.^o-weighted phylogenetic trees on the same set S 
of taxa such that 1{T) = 1{T'). Then: 

(a) The nested taxa ofT and T' are the same. 

(b) T has its root labeled with i if, and only if, T' has its root labeled with i. 

(c) If the nodes labeled with i inT and T' are not their roots, the weight of the arc with 
head i inT and in T' is the same. □ 

Let 5 be a set of taxa and 'R-{S) the set of S -triples, that is, of structures ab\c with 
a,b,c G S pairwise different. Classically, an S'-triplet ab\c is said to be present in a 
phylogenetic tree T if c diverged from a before b did, in the sense that [a, b]T < [a, c]t = 
[b, c]t. Let now (T, uj) be an M^o-weighted phylogenetic tree on S. For every ab\c G 71(8), 
let XT{ab\c) G M^o be defined as follows: 

— If ab\c is present in T, then XT{ab\c) is the distance from [a, c]t = [b,c]T to [a, 6]t 

— If ab\c is not present in T, then XT{ab\c) = 0. 

Notice that Ar(a6|c) = Ar(6a|c). 

This mapping At has a simple description in terms of i{T). 

Lemma 3. Let (T, u) be an M.-^q -weighted phylogenetic tree on S. For every ab\c G 1^(8), 

XT{ab\c) = max{£T(a, c) — £t{0', b), 0}. 

Proof. If [a, c]t is a non-trivial ancestor of [a, 6]^ in T, then the path [a, c]^ contains 
the node [a, 6]^ and the distance £t{<i, c) from [a, c]t to a is equal to the distance XT{ab\c) 
from [a, c]t to [a,b]j' plus the distance iT{a,b) from [a,b]T to a. Therefore, in this case, 

max{£T(a, c) — iria, b),0} = iTia, c) — •^r(o, b) = XT{ab\c). 

If [a, c]t = [a, then iria, c) = iria, b) and ab\c is not present in T and thus 

m.ax{iT{a, c) — £t{ci, b), 0} = = XT{ab\c). 
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Finally, if [a, cj^^ is not an ancestor of [a,b], then it must happen that [a, 6]t is a non- 
trivial ancestor of [a, c]t and therefore £t{ci-, h) ^ ^r(a; c). Since ah\c is not present in T, 
either, this implies that 

max{£T(a, c) — Iria-, b), 0} = = XT{ab\c). 

So, the equality in the statement always holds. □ 

The following result is Thm. 2 in [14]. In it, Q{X) denotes the set of X-quartets, that 
is, of structures ab\cd with a,b,c,d & X pairwise different. 

Theorem 1. Let A : TZ{S) M^o a map such that X{ab\c) = X{ba\c) for every 
a,b,c E S pairwise different, and let z be an element not in S. Then: 

(a) X = Xt for some W^o-weighted phylogenetic tree {T,uj) with neither nested taxa nor 
weight internal arcs if, and only if, the mapping ji : Q{S U {z}) — > M^o defined by 



fj,{ab\cd) 



X{ab\c) if d = z 

min{A(a6|c), X{ab\d)} + min{A(c(i|a), A(c(i|6)} if d ^ z 



satisfies the following properties: 

(1) ii{ab\cd) = ii{ba\cd) = ii{cd\ab) 

(2) For every a,b,c,d, at least two of ii{ab\cd), fj,{ac\bd), and iJ,{ad\bc) are equal to 
0. 

(3) If iJ,{ab\cd) > 0, then, for every x ^ a,b,c,d, either fj,{ab\cx) ■ iJ.{ab\dx) > or 
IJ,{ax\cd) ■ n{bx\cd) > 0. 

(4) For every a,b,c,d,e, if ^[ab\cd) > fi{ab\ce) > 0, then 

fj,{ae\cd) = fi{ab\cd) — fi{ab\ce). 

(5) For every a,b,c,d,e, if fi{ab\cd) > and n{bc\de) > 0, then 

IJ.{ab\de) = fi{ab\cd) + iJ,{bc\de). 

(b) If (T^uj) and {T\uj') are two W^Q-weighted phylogenetic trees with neither nested 
taxa nor weight internal arcs and such that Xt = Xt' , then T = T' as phylogenetic 
trees and the isomorphism preserves the weights of the internal arcs. □ 

Now we can proceed with the proof that splitted path lengths matrices characterize 
R>o-weighted phylogenetic trees. 

Theorem 2. Two R^Q-weighted phylogenetic trees on the same set S of taxa are iso- 
morphic if, and only if, they have the same splitted path lengths matrices. 
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Proof. As in Proposition 1, the statement when \S\ = 1 is obviously true. Assume now 
that l^l ^2. For every M>o-weighted phylogenetic tree (T, lo) on S, let (T, U) be the M^q- 
weighted phylogenetic tree without nested taxa obtained as follows: for every internal 
labeled node i of T, unlabel it and add to it a leaf child labeled with i through an arc 
of weight 0. It is straightforward to check that Ixii,]) = £jr{i,j) for every i,j £ S. Since 
T was M>o-weighted, the only weight arcs in T are the new pendant arcs that replace 
the nested taxa. Moreover, (T, co) can be recovered from (T, aJ) by simply removing the 
weight pendant arcs and labeling the tail of a removed arc with the label of the arc's 
head. 

Let now {Ti,u;i) and {T2,uj2) be two M>o-weighted phylogenetic trees on the same 
set S of taxa such that £{Ti) = ^(Ts). Then i{Ti) = ^(Ta) and hence, by Lemma 3, 

= Ay^. Since (ri,a;i) and {T2,UJ2) are M^o-weighted phylogenetic trees with neither 
nested taxa nor weight internal arcs, by Theorem l.(b) we have that Ti = T2 as 
phylogenetic trees, and moreover this isomorphism preserves the weights of the internal 
arcs. But we also know that the arc ending in the leaf i has the same weight in Ti and 
in T2: Hi was a nested taxon of Ti and T2 (and recall that Ti and T2 have the same 
nested taxa by Corollary l.(a)), this weight is in both cases 0, and if i was the label of a 
leaf of Ti and T2, this weight is the same in Ti and in T2 by Corollary l.(c), and hence 
in Ti and in T2- 

Therefore, the isomorphism Ti = T2 is an isomorphism of weighted phylogenetic 
trees. Finally, the way {Ti,uji) and {T2,uj2) are reconstructed from (Ti,a7i) and (T2,aJ2) 
implies that this isomorphism induces an isomorphism of weighted phylogenetic trees 
Ti = T2. 

This proves the 'if implication; the 'only if implication is obvious. □ 

Remark 2. The proof of the last theorem can also be applied, with small modifications, 
to prove that the splitted path lengths matrices also separate M>o-weighted phylogenetic 
trees with multi-labeled nodes, that is, where a node can have more than one label (but 
two different nodes cannot share any label); in such a tree T, if i and j are labels of 
the same node, then ^7^(1, j) = Ixij^i) = 0. It is enough to slightly change the definition 
of T: on the one hand, for every internal labeled node of T, unlabel it and, for each 
one of its labels, add to it a leaf child labeled with this label through an arc of weight 
0; and, on the other hand, do the same for every leaf with more than one label. The 
same argument as in the proof of the last theorem shows that if Ti and T2 are two 
R>o-weighted phylogenetic trees with multi-labeled nodes such that ^(Ti) = (■{T2), then 
the M^o-weighted phylogenetic trees with neither nested taxa nor weight internal arcs 
Ti and T2 obtained in this way are isomorphic. To derive from this isomorphism an 
isomorphism Ti = T2, one must use that, in this multi-labeled case: 

— An internal node of a tree T is labeled {ii,... ,ik} if, and only if, ^t(«, b) = for 
every a,b G {ii, . . . , ik}, (-riaj) > or irij, a) > for every a G {n, • • • , h} and 
every j ^ {ii, . . . , ik}, and there exists some j ^ {ii, . . . , i^} such that iT{a,j) = 
for every a £ {ii, . . . ,ik}- 
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— A leaf of T is labeled if, and only if, Ixid^h) = for every a,b E 

{ii, . . .,ik}, and iriaj) > for every a G {ii, ... ,4} and every j ^ {n, . . .,ik}. 

These properties entail that if i(Ti) = i{T2), then Ti and T2 have the same families of 
sets {ii, . . . ,ik} of labels of internal nodes as well as of leaves. We leave the details to 
the reader. 

Notice that Theorem 1 not only establishes that the mapping At singles out an M^q- 
weighted phylogenetic tree T with neither nested taxa nor weight internal arcs, up to 
the weights of its pendant arcs, but it also characterizes what mappings can be realized 
as Aj^-mappings, for some T of this type. Wc can use this result to characterize the 
matrices that are splitted path lengths matrices of M>o-weighted phylogenetic trees. 

Proposition 2. Let M = (mjj) G MnO^^o) be an n x n square matrix over M^g ''^ith 
nT'i,i = for every i = 1, . . . ,n. Then, M = i{T) for some M^Q-weighted phylogenetic 
tree T on S = {1, . . . ,n} if, and only if, the mapping Am : T^{S) — > M^o defined by 

XM{ab\c) = max{ma,c - ma,b,0} 

satisfies the following conditions: 

(a) XMiab\c) = XMiba\c) for every a,b,cG S pairwise different. 

(b) The mapping hm defined from Am o,s in Theorem l-(a) satisfies properties (l)-(5) 
therein. 

Proof. The 'only if implication is easy: if M = liT), so that niij = iT{i,j) for every 
i,j G S, then Am = A^, with T the M^o-weighted phylogenetic tree without nested taxa 
or weight internal arcs associated to T in the proof of Theorem 2, and therefore it 
satisfies conditions (a) and (b) in the statement. 

Conversely, if Am satisfies conditions (a) and (b), then by Theorem 1 there exists an 
M^o-weighted phylogenetic tree Tq without nested taxa or weight internal arcs such 
that Am = Xtq- By Lemma 3, ATg(o6|c) = maxj^T-^ (a, c) — £To{0',b),0}. Therefore, for 
every a,b,c E S pairwise different, 

max{^To (a, c) - Itq {a, b),0} = max{ ma,c - 'rna,b, 0}. 

The tree Tq is unique up to the weights of the pendant arcs. So, without any loss of 
generality we may assume that the weight of the arc ending in the leaf a is 

min{ma,j \ j ^ a}. 

Now, for every a e S and for every b e S \ {a}, 6 is a descendant of the parent Xa 

of a in Tq if, and only if, m^^fe = inm{maj \ j a}- As far as the 'if implication goes, 
assume that 5 = minjmaj | J 7^ a} but b is not a descendant of x^. Let c G S" \ {a} 
be a descendant of Xa, so that [a, c]ro = Xa- Then, [a, c]to is a non-trivial descendant of 
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[a, b]To and therefore (since the internal arcs of Tq have non-negative weight), ^To(fl) b) — 
£To(a)C) > 0. But this contradicts the fact that, since ma,c ^ ^a,bj 

iTo{a,b) - £To{a,c) = XTo{ac\b) = XM{ac\b) = min{ma,b - ma,c,0} = 0. 

As far as the converse impUcation goes, let 6 G 5 \ {a} be a descendant of Xa, and let 
b' & S \ {a} be such that nia^b' = niin{maj | j ^ a}: as we have just seen, b' is also a 
descendant of Xa and therefore [a, ^]t() = [^^j ^'Ito = a^a- Then, max{ma^b ~ iTT-afi'^O} = 
XTo{ab'\b) = implies that ma,b — fna,b' ^ Oj that is, that nia^b = min{maj | j / a}, too. 

Now, let us a fix a taxon a e S, and let 6 G 5 \ {a} be a descendant of the parent 
Xa of a in To . Then, on the one hand, £To{a,b) = nia^bi because it is the weight of the 
arc (xa,a), and, on the other hand, for every c ^ a,b, we have that ma,c ^ n^a,b and 
£to(o, c) ^ £To{a,b) and therefore 

ma,c = XM{ab\c) + ma,b = Aro(a&|c) + ^To(a, b) = iroia, c)- 

This implies that the a-th row in M and OTq) are equal, and hence, since a was any 
element of S, M = 1{Tq). 

Finally, Tq is transformed into an M>o-weighted phylogenetic tree with the same 
splitted path lengths matrix by simply removing the weight pendant arcs and labeling 
the tail of a removed arc with the label of the arc's head; cf. the proof of Theorem 2. □ 

5 Splitted nodal metrics 

Let Tn be the space of M>o-weighted phylogenetic trees on the set S = {1, . . . , n\ of taxa. 
As we have seen, the mapping 

that associates to each (T, w) G 7^ its splitted path lengths matrix £{T) is injective up 
to isomorphisms. As it happened with the embedding L : BT^ ^ M'^('*~i)/2^ this allows 
one to induce metrics on 7^ from metrics on A^„(M^o)- 

Proposition 3. Let D be any metric on Mni^^o)- The mapping 

(ri,r2)^z?(£(ri),£(r2)) 

satisfies the axioms of metrics up to isomorphisms: 

(1) d{Ti,T2)^0, 

(2) d{Ti,T2) = if, and only if T^ = T2, 

(3) d{Ti,T2) = diT2,Ti), 

(4) d{T,,n) ^ d{Ti,T2) + d{T2,T^). 



13 



Proof. Properties (1), (3) and (4) are direct consequences of the corresponding properties 
of D, while property (2) follows from the separation axiom for D (which says that 
D{Mi, M2) = if, and only if, Mi = M2) and Theorem 2. □ 

We shall generically call splitted nodal metrics the metrics on 7^ induced by metrics 
on A^„(M^o) through the embedding i. In particular, every norm || • ||p on A^„(M^o) 
defines a splitted nodal metric dp through 

d;{T,,T2) = \\e{n)-e{T2)\\p. 

For instance, 

are the splitted nodal metrics induced by the and norms on Mn(^^o)- 

We have seen in the previous section that the splitted path lengths matrices can be 
computed in O(n^) time. Their difference can be computed in O(n^) time, and the sum of 
the p-th powers of the entries of the resulting matrix can be computed in 0{n^ log(p) +n^) 
time (assuming constant-time addition and multiplication of real numbers). Therefore, 
the cost of computing (f^(Ti, T2)p, for Ti, T2 G and p £ N+, is 0{n^ log{p)+n^). Thus, 
ii p = 1, the df metric on 7^ can be computed in O(n^) time. For p ^ 2, the cost of 
computing dp(Ti,T2), for Ti,T2 as the p-th root of dp{Ti,T2y will depend on the 

accuracy with which this root is computed. For instance, using the Newton method to 
compute it with an accuracy of an l/2''-th of its value has a cost of 0(p^ log(p) log{hp)); 
see, for instance, [4]. So, in practice, for small p and not too large h, this step will 
be dominated by the computation of dp{Ti,T2y, and the total cost will be O(n^) (we 
understand in this case log(p) as part of the constant factor). For p = or 00, the cost 
of computing dp(Ti, T2) is also 0{v?) time. 

These splitted nodal metrics can be seen conceptually as the generalizations to Tn 
of the classical nodal metrics on BT^. Conceptually, but not numerically, because the 
restriction of dp to BTn is not equal to dp, even up to a scalar factor, as the following 
easy example shows. 



Example 2. Consider the non- weighted binary trees T\,T2,T2, depicted in Fig. 4. It is 
easy to compute their path lengths vectors and splitted path lengths matrices: 

L(Ti) = (3,4,4,3,3,2), L(T2) = (2, 3, 4, 3, 4, 3), L(r3) = (4, 4, 3, 2, 3, 3) 





/0111\ 




/O 1 23\ 




m) = 


20 11 
3 2 1 
\3210j 


, KT2) = 


1023 
1102 

^iiioy 





'0 1 1 1\ 
30 12 
3 102 
.2110/ 
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Fig. 4. The non-weighted binary phylogenetic trees in Example 2. 



Prom these vectors and matrices we obtain that 



dp{Ti,T2) = dpiTi,Ts) 



'4 if p = 
^ if p G N+ 
1 if p = 00 



while 



10 if p = 

d;{Ti,T2) = { ^6 + 4 • 2f if p G N+ 
2 if p = 00 



<m,r3) 



r 6 if p = 

V6 if p G N+ 
1 if p = 00 



This shows that there does not exist any A G M such that dp = X ■ dp on BT4 for any 
p G N U {00}. Similar counterexamples can be produced for every n ^ 4. 

The following inequality relates dp and d^ on any BT^- 

Proposition 4. For every Ti,T2 G BTn and for every p G N U {00}, 

'd;{T\,T2) ifp = 
dp{Ti,T2) ^ I 2^-vd'p{Ti,T2) ifpeN+ 
2d;{Ti,T2) ifp = <x 

Proof. For every T G HT„, let L*{T) be the symmetric matrix 

L*{T) =£(T)+i(TY. 

Notice that the (i, j)-th and the {j, i)-th entries of L*{T) are both equal to LT{i,j)- Now, 
by the usual properties of norms, 

||L*(ri) - L*{T2)\\p = \\e{Ti) + e{T,Y - {£{T2) + iiT2Y)\\p 
^ \\i{T,) - e{T2)\\p + \\e{TiY - e{T2Y\\p 

= 2\\£{T,)-i{T2)\\p. 

On the other hand, L*{Ti) — L*{T2) can be understood as two concatenated copies of 
L{Ti) - L{T2) and therefore, 



\L*{Ti)-L*{T2)\\p 



' 2\\L{Ti) - L{T2)\\p ifp = 

^•||L(ri)-L(T2)||pifpGN+ 
(\\L{T,)-L{T2)\\p ifp = oo 
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Combining this equahty with the previous inequahty we obtain the inequahty in the 
statement. □ 



6 The non-weighted case 

Although weights enrich the topological structure of a phylogenetic tree, for instance by 
adding probabilities, bootstrap values or divergence degrees to branches, the comparison 
of non- weighted phylogenetic trees, as bare hierarchical classifications or evolutive histo- 
ries, has an interest in itself. Let MT n denote the class of all non- weighted phylogenetic 
trees on 5 = {1, . . . , n}. Felsenstein [12] gave a recurrent formula for the number \J (n, m) 
of different trees in MT^ with m unlabeled internal nodes, from which the total number 
\MTn\ of different non- weighted phylogenetic trees on n taxa can be computed: see Table 
2 in [12] or sequence A005264 in [27]. Table 1 recalls the first values of \MTn\- 



n 


1 


2 


3 


4 


5 


6 


7 




1 


3 


22 


262 


4 336 


91984 


2 381 408 



Table 1. The values of \AfT„\ for n up to 7 



In this section we gather some results on the splitted nodal metrics d^, for p G N"*", 
on MTn, and we report on some numerical experiments for df and on this class. To 
simplify the notations, for every a,b G S and p 6 N"*", we shall write Cj.^ t2('^' ^) denote 
\£ti {a,b)- £t2 {a, b) \p . In this way, if Ti , Ta e MT^ and peN+, then 

d;iT,,T2r= Yl CP,,TMb)eN. 

(a,6)€52 

Our first result shows that the metrics dp have a redundant factor on MTn when n 
is odd. 

Lemma 4. Ifn is odd, then ||£(T)||i is even, for every T G MT n- 

Proof. Let T = (E, V) be a non- weighted phylogenetic tree on S = {1, . . . ,n} with n 
odd. For every e E E, let veie) be the number of paths [i,j] i, with i,j G S, that 
contain the arc e. It is clear that 

\\£{T)h= Yl iT{hj)=Y.^t{e). 

It turns out that if n is odd, then every vi{e) is even and therefore the right-hand side 
sum is even. Indeed, let e = {u, v) be any arc and let V be the set of descendant labeled 

nodes of v. Then, e is contained in a path [i,j]^i if, and only if, i G ^ and j ^ V . This 
shows that i'i{e) = \V\ ■ \S — V\. Now, since [^l is odd, either \V\ or \S — V\ is even, 
which implies that veie) is even. □ 



16 



Proposition 5. If n is odd, then dp{Ti,T2y is even, for every Ti,T2 G MTn and for 
every p G N"*" . 

Proof Let Ti,T2 G AfTn, with n odd. Then 

Now, we know that J2i<iijkj^n^Ti{hj) and Z]i<i^j^n^T2(i,i) are even numbers. This 
implies that the number 

G 52 I C^T„T,ihj) odd}| = G I -^T.(^,J) odd}| 

is even, and hence that the sum "Ylfi^i^j^n^Ti T2^^^3) even. □ 

This result shows that if n is odd, d\ takes only even values on MT ni and therefore it 
can be divided by 2 and the resulting values are still integer numbers. In a similar way, d\ 
has a 'redundant' \/2 factor on NT n-, for n odd. No similar result holds for even values 
of n: for instance, NT 2 consists of three trees Ti,r2,r3, with Newick strings (1,2) ;, 
((1)2) ;, and ( (2) 1) ;, respectively, and (ri,r2) = (ri,r3) = 1, d\{j:2,T^) = 2. 

Remark 3. The theses in the last two results are true in the more general setting of 
N+-weighted phylogenetic trees. To see it, notice that if (T, a;) is such a tree, then 

and then, the proof that each //^(e) is even is the same as in the non- weighted case. 

On the other hand, the thesis in the last proposition does not generalize to p = or 
00: it is easy to produce counterexamples showing that and take odd values on 

Our next goal is to find the least value for dp on A/'T„, for p G N"*". 

Lemma 5. Let Ti,T2 G AfTn with n ^ 6 and p G N+. // there is some taxon that is a 
leaf of largest depth in Ti but not in T2, then dp{Ti,T2Y ^ 5. 

Proof. To simplify the notations, and since in this proof the trees Ti, T2 and the index p 
are fixed, we shall write C(a, 6) to denote Cj,^ T2^a, h). 

Assume, without any loss of generality, that 1 is a deepest leaf of Ti and that 2 

is a leaf of T2 such that de]it\\j'^{2) > dcpthji2(l)- Then, the distance from [l,2]r2 to 
2 wiU be larger than to 1. This implies that %(2, 1) > %(1,2). Since ^Ti(2, 1) 
£ti(1,2) (because de])th.rp^{2) ^ depthj^^(l)), it must happen that ^t2(2, 1) / ^ri(2, 1) or 
£t2(1, 2) 7^ ^Ti (1, 2), and therefore 

C(l,2) + C(2,l) ^ 1. 
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Let us check now that, for every a G S'\{1,2}, at least one of the following four equalities 
does not hold: 

%(l,a) =%(l,a), %(2,a) =%(2,a) 

(a, 1) = % (a, 1) , % (a, 2) = % (a, 2) ^ ^ 

This will imply that every aG-S'\{l,2} contributes 1 to dp{Ti,T2y, in the sense that 

C{1, a) + C(2, a) + C(a, 1) + C(a, 2) ^ 1. 

Since there are at least 4 taxa in \ {1, 2} and these contributions add up to C(l, 2) + 
C(2, 1), this will prove that dl{Ti,T2Y ^ 5. 

The way each a G 5'\ {1, 2} contributes to (ip(Ti, T2)^ depends on its relative position 
with respect to 1 and 2 in T2. 

— If a ^ 1, then ^1^,(1, a) = but ^Ti(l,a) > and therefore iT2{^,o) 7^ ^ri(l!«')- 

— Assume that [a, Ij^j — ['^^'At2 ^ [Ij^Jt^j- this case £7-2(0,2) = It2{o-,^) and 
£x'2(2,a) > £t2(1)'^)- But these relations cannot hold in Ti, because they imply that 
depthy^(2) > depth2.j(l). Thus, the equalities (1) cannot hold simultaneously. 

— Assume that 1 < [a, 1]t2 < [1)2]t2- In this case AT2(al|2) > and 

£r2(a,l) + AT2(al|2) =£T2(a,2) 
£T2(l,a) + AT2(al|2)=£T2(l,2) 
£r2(2,a)=£T2(2,l) 

If % (a, 1) = ^T2 {a, 1) and £ti (a, 2) = £t2 (a, 2) , then the fact that in (a, 2) > % (a, 1) 
implies that 1 < [a, < [1)2]ti and thus 

ATi(al|2) = %(a,2) -£ri(a,l) 

= eT2{a,2)-eT2{a,l) = XT2{al\2). 

Then, if £ri(l,a) = %(l,a), 

%(1,2) =£ri(l,a) + ATi(al|2) 

= %(l,a) + AT2(al|2) =£t2(1,2). 

Finally, if %(2,a) = £T2(2,a), then 

%(2,1) =£Ti(2,a) =£T2(2,a) =£t2(2,1). 

And this leads to a contradiction, because, as we have seen at the beginning of the 
proof, £t2(2, 1) / £ri(2, 1) or £72(1, 2) / %(!, 2). Therefore, the equalities (1) cannot 
hold simultaneously. 

— If 2 < [a, 2] < [1) 2]t2, a similar argument shows that at least one of the equalities 
(1) fails, too. 

This finishes the proof of the lemma. □ 
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T 



Fig. 5. Two non-isomorphic phylogenetic trees in NTn such that dpiT^T'Y = 4 for every p € N"*". 

Theorem 3. For every p G N"*" and for every n ^ 2; 

(1) Ifn^5, then mm{d;{Ti,T2)P \ Ti,T2 G MT^, Ti 7^ T2} = n - 1. 

(2) Ifn^6, then mm{d;{Ti,T2)P \ Ti,T2 € AfTn, Ti ^ T2} = 4. 

Proof. To simplify the notations, and since in this proof the trees Ti,T2 and the index p 
are fixed, we shall write C{a, b) to denote C^^ t2(^' 

The cases n = 1 to 5 can be checked 'by hand' through the computation of the 
distances between all pairs of trees in NT^- In the case n = 1, there is only one tree in 
MTi, and, as we mentioned after Lemma 4, AfT2 consists only of three trees Ti,T2,Ts, 
with Newick strings (1,2);, ((1)2);, and ( (2) 1) ;, respectively, and it can be seen that 
T2)P = d^{Ti,Tsf = 1, d'p{T2, T-iY = 2. As far as the cases n = 3, 4, 5 go, the files 
{3,4, 5}-tree-nt-pairs . dat available at the Supplementary Material web page contain 
the values of dp{Ti,T2Y for each (unordered) pair of trees {Ti, in the corresponding 

Now, for n ^ 5, we shall prove by induction on n that (ip(Ti,T2)^ ^ 4 for every pair 
of different trees Ti,T2 G MTn- Since it is easy to produce pairs of trees T\,T2 G NTn 
such that dp{Ti,T2y = 4, like for instance those depicted in Fig. 5, this will finish the 
proof of the statement. 

The starting point for the induction procedure is n = 5: we know (by direct inspection 
of the file 5-tree-nt-pairs.dat) that dp{Ti,T2y ^ 4 for every pair of different trees 
Ti,r2 G NT^. Assume now that this inequality holds for every two trees in NT^, for 
some n > 5, and let us prove it for MTn+i- 

So, let Ti, T2 G MTn+i be a pair of different trees. As in the last proof, we shall write 
C(a, b) to denote (7^^ ^j^"^' 

Without any loss of generality, we assume that n + 1 is a leaf of largest depth in Ti . 
By Lemma 5, if n + 1 is not a deepest leaf of r2, then dp{Ti,T2Y ^ 5. So, in the rest of 
the proof we assume that n + 1 is also a deepest leaf of T2. In particular, in both trees, 
the siblings of n + 1 (if they exist) arc also deepest leaves. 

We distinguish now two main cases, each one divided in several subcases. 

(a) Assume that the parent of n + 1 in Ti is labeled, say with n. This implies that 

% (n, n + 1) = 0, £ti (n + 1, n) = 1 

{n + 1; a) = ixi {n, a) + 1, for every a £ S \ {n, n + 1} 
ixi {a, n + 1) = £ti {a, n), for every a G 5" \ {n, n + 1} 
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We distinguish the fohowing subcases. 
(a.l) Assume that, in the node n is an ancestor of n + 1, but not its parent. In this 
case, lT2{n, + l,n) > 1, and therefore 

C{n + l,n) ^ 1. 

Now, let a € S \ {n, n + 1}. Let us sec that a contributes at least 1 to dp{Ti, T2Y. 

— If n > [a, n + 1]t2 (that is, if a is a descendant of an intermediate node in the 
path + then iT2{a-,n + 1) < £T2ia,n) and therefore, since £ti(o, n + 1) = 

{a, n), it must happen that (a, n + 1) 7^ £r2(fl) ^ + 1) or £ti (a, n) 7^ £t2{'^-, 
which implies that 

C(a,n) + C(a,n + 1) ^ 1. 

— If n ^ [a, n + 1]t2 in 72, then 

£t2 + 1, a) = It2 (") o) + (-T2 {n + 1, n) > {n, a) + 1, 

and therefore, since It^ {n+1, a) = ir^ (n, it must happen that £ti ("+1, a) ^ 

£t2 (n + 1, a) or £t^ (n, a) / ^7^2 ("-i a) ) and hence 

C{n,a) + C(ra + l,a) ^ 1. 

Since there are at least 4 taxa other than n and n + 1, and their contributions add 

up to C(n+ l,n), wc conclude that, in this case, dp{Ti,T2)^ ^ 5. 
(a. 2) Assume that, in T2, the node n is not an ancestor of ra + 1; set 

%(n, n+ 1) = X ^ 1, £T2{n+ l,n) = y ^ 1. 

If X ^ y, then depth2i2(n) ^ depth7T2(n + 1) and thus, since n + 1 was a deepest leaf 
of n would also be a deepest leaf of T2. But n is not a deepest leaf of Ti and 
therefore, in this case, we already know by Lemma 5 that dp{Ti,T2)^ ^ 5. 
Assume now that x < y. Then, y ^ 2 and thus, on the one hand, 

C(n + l,n) + C(n,n + l) = {y - 1)p + xP ^ 2 

and, on the other hand, the path [n + l,n]T2 ■^n + 1 has at least one intermediate 
node: let oq 7^ n + 1 be a labeled node that is a descendant of the parent of n + 1 
(notice that, in this case, oq is either the parent of n + 1 or its sibling). Then, 

£T2iao,n+ 1) < £T2iao,n), £T2{n + l,ao) = 1 ^ £T2{n,ao) 

imply that 

C(ao, n + 1) + C{ao, n) ^ 1, C(n + 1, oq) + C(n, uq) ^ 1. 
So, in this case, d^{Ti,T2) ^ 4. 
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(a.3) Assume that, in T2, the node n + 1 is a leaf and its parent is n. Let T*, T| G MTn be 
the trees obtained from Ti and T2, respectively, by removing the leaf n + 1 together 
with its pendant arc. After this operation, we have that, for every 1 ^ a ^ b ^ n, 
It* (a, h) = {a, b) and therefore, C{a, b) = Cj.« j,* (a, 6). Then, 

the last inequality being given by the induction hypothesis. 

(b) Assume now that the parent of n + 1 is not labeled. Therefore, n + 1 must have at 
least one sibling, which, we recall, is a leaf. Without any loss of generality we assume 
that 77. is a sibling of n + 1. In this case, we have that 

^Ti {n, n + 1) = (n + 1, n) = 1 

£ti (n + 1, a) = (-Ti {n, a) > 0, for every a e S \ {n, n + 1} 
£ti (a, n + 1) = {a, n), for every a E S\ {n, n + 1} 

Notice moreover that n is also a deepest leaf in Ti and therefore, by Lemma 5, if it is 

not a deepest leaf in T2, then dp{Ti,T2)^ ^ 5. So, we assume henceforth that n and 
n + 1 are deepest leaves in T2. As in (a), there are several subcases to discuss. 
(b.l) Assume that, in T2, the leaves n and n + 1 arc not sibling. In this case, 

iT2in,n + 1) = X ^ 1, £T2{n + l,n) = y ^ 1 

and a; > 1 or y > 1. Since the depths of n and n + 1 in r2 are the same, it must 
happen that x = y. Then, 

C{n, n + 1) + C(n + 1, n) = {x - If + {x - If ^ 2. 

Let now 00 7^ n a labeled node, other than n, that is a descendant of the parent of 
n in r2: notice that this parent is an intermediate node in the path [n, n + 1]t2 ~~*n. 
Then, 

iT2{ji,aQ) = 1 < X = £t2(?^ + l,ao), ^TiiflOin) < £T2{aQ-,n + 1) 

imply that oq contributes at least 2 to dp{Ti,T2Y , and therefore that dp{Ti,T2y ^ 4. 
Actually, dp{Ti, T2Y ^ 6, because any labeled node bo ^ n + 1 that is a descendant 
of the parent of n + 1 in T2 will also contribute at least 2 to dp{Ti,T2Y . 

(b.2) Assume that, in T2, the leaves n and n + 1 are siblings and their parent is labeled, 
say with 1. In this case, by (a) (applied interchanging the roles of Ti and T2 and the 
roles of n and 1), we already know that dp{Ti,T2Y ^ 4. 

(b.3) Assume that, in T2, the nodes n and n + 1 are sibling leaves and their parent is 
not labeled. In this case, let T^,T2 £ MT^ be the trees obtained from Ti and T2, 
respectively, by removing the leaves n and n + 1 together with their pendant arcs, 
and labeling with n the former parent of n and n + 1. In this way we have that, for 
every 1 ^ a 7^ 6 ^ n and for every i = 1,2, 

£t* (o., b) = (a, b) \i a ^ n 
(n, b) = {n,b) — 1 if a = n 
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and therefore, C{a,b) = Cj, y*(a, 6). Then, arguing as in (a.3), 

d;{T,,T2r^d;{T^,T^r^4. 

This finishes the proof by induction. □ 

Remark 4- Following in detail the arguments developed in the last theorem until their 
last consequences, it can be proved that, for n ^ 6, the pairs of trees Ti, T2 in AAT„ such 
that dp{Ti,T2y = 4, for every p G N"*", are exactly those pairs such that di{Ti,T2) = 4, 
and they have the following form. Let 11,12,^3 be any three taxa in S and let Tq be any 
non-weighted rooted tree with some of its nodes, including all its elementary nodes and 
all its leaves except at most one elementary node or one leaf, labeled in S \ {11,12,^3}- 
Then, Ti and T2 are obtained, respectively, by attaching to Tq at the same node the 
'basic' trees T{ and T2 or T{' and T2 in Fig. 6. The attachment of one of these trees 
at a node i; in T is carried out by identifying the node with the root of the tree, and 
in such a way that the resulting trees Ti and T2 have all their leaves and elementary 
nodes labeled. This implies that if T had some non-labeled leaf or elementary node, this 
is necessarily the node where the basic trees must be attached, and that (since Tg' has 
its root elementary), the basic pair T[' , T2 cannot be attached to a non-labeled leaf (this 
would create an elementary node in T2). 

For instance, the trees T and T' in Fig. 5 are obtained by attaching the basic trees 
r( and T2 (with ii = 1, 12 = 2, and is = 3) to the tree with Newick code (4, . . . ,n) ; . 




Fig. 6. The pairs of basic trees tliat give rise, wlien attaclied to tlie same place in a tree, to pairs of 
non-weighted phylogenetic trees at dp distance 



Remark 5. It can be checked that the pairs of difi"crcnt trees in MT n at least distance 
for d\ have always splitted path lengths matrices with n — 1 (if n ^ 5) or 4 (if n ^ 5) 
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entries that differ in only 1. This implies that the least non-zero value for on AAT„ 
is always 1, and that the least non-zero value for dg on AfT^ is again n — 1 for n ^ 5 
and 4 for 77, ^ 6. 

Unfortunately, we have not been able to find a formula for the diameter of NT^ with 
respect to any metric d^ with p G N+. Actually, and to our knowledge, the diameter of 
the space of non-weighted binary phylogenetic trees with respect to the nodal metrics d\ 
and d2 is still not known, either. Not knowing a formula for the diameter, we are not able 
to give an explicit description of the distribution of distances for any p, either. In the file 
distributions.pdf in the Supplementary Material we provide the distributions of d\ 
and (c?!)^ (that is, of d| squared) on A/^T„ for n = 3,4,5,6, as well as the distributions 
of the values of d\ and ((i^)^ applied to pairs of trees in TreeBASE sharing n = 2 to 6 
labels. 



7 Conclusions 

Some classical metrics for phylogenetic trees are based on the comparison of the rep- 
resentations of rooted phylogenetic trees as vectors of path lengths between pairs of 
labeled nodes. But these metrics only separate non-weighted binary rooted trees: two 
more general non-isomorphic rooted phylogenetic trees can have the same such vectors 
of path lengths, and therefore be at zero distance for these metrics. In this paper we have 
overcome this problem by representing a rooted phylogenetic tree by means of a matrix 
with rows and columns indexed by taxa and where every entry (i, j) is the distance from 
the least common ancestor of the pair of nodes labeled with i and j to the node labeled 
with i. We call these matrices splitted path lengths matrices, because they split in two 
terms the path length between every pair of labeled nodes. These matrices define an in- 
jective mapping from the space 7^ of all ]R>o-weighted rooted phylogenetic trees with n 
labeled nodes and possibly nested taxa into the set Mn{^) of n x n real-valued matrices. 
Therefore, any norm on A^„(IR) applied to the difference of the splitted path lengths 
matrices of trees defines a metric on 7^. Using the well-known LP norms on A1„(M), for 
p G N U {oo}, we obtain the family of splitted nodal metrics d^ on %, 



f |{ (i,j) I 1 ^ i / i ^ n, irAiJ) / iT,ii,j)}\ ifp = 

ET^^^J^n l^nihj) - iTA^,J)\^ if p G N+ 

^ max{|£Ti {i, j) - ^T2 («, i)| I 1 ^ ^ / j ^ if p = oo 



We have proved several properties for these metrics dp on the subspace MTn of 
non-weighted rooted phylogenetic trees possibly with nested taxa. For instance, we have 
established the least distance between any pair of such trees. It remains as an open 
problem to find the diameter of A/'T„ with respect to these metrics, and the distribution of 
their values. Actually, these problems also remain open for the classical nodal distances on 
non-weighted binary (rooted as well as unrooted) trees. These are interesting problems: to 
know the largest value reached by a metric is necessary to normalize the metric between 
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and 1, while knowing the distribution of the values allows one to answer the question 
of whether two trees are more similar than expected by chance [19]. We hope to report 
on these problems in a near future. 

We cannot advocate the use of any splitted nodal metric dp over the other ones 
except, perhaps, warning against the use of 

dg(Ti,r2) = G 52 I irA^J) / iT,{i,j)}\ 

dl,{Ti,T2)=ma^{\iT,{i,j)-£TAhj)\ I ihj) e S^} 

because they are too uninformative. Since the most popular norms on M™ are the Man- 
hattan and the Euclidean, it seems natural to use df and d^, as it has been the case 
in the classical, non-weighted binary setting. Each one has its advantages. For instance, 
the computation of does not involve square roots, and therefore it can be computed 
exactly and, if the weights are integer numbers, the resulting value is an integer number. 
Moreover, it is well known that, for every p G N+, 

^ for every x G M"^ 

and therefore, 

d;{TuT2) ^ d{{Ti,T2) for every Ti, G T„. 

On the other hand, the comparison of splitted path lengths matrices by means of the 
Euclidean norm enables the use of many geometric and clustering methods that are not 
available otherwise. For instance, the specific properties of the Euclidean norm allowed 

Steel and Penny to compute explicitly the mean value of the nodal distance ^2 on the 
class of non- weighted unrooted binary trees [29] , while no similar result is known for di . 

As a rule of thumb, we consider suitable to use df when the trees are non- weighted 
(of when they have integer weights), because these trees can be seen as discrete objects 
and thus their comparison through a discrete tool as the Manhattan norm seems appro- 
priate. When the trees have arbitrary positive real weights, they should be understood 
as belonging to a continuous space [5], and then the Euclidean norm is more appropriate. 

Supplementary Material 

The Supplementary Material referenced in the paper is available at 
http : / /bioinf o . uib . es/~recerca/phylotrees/nodal/. 
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